Goto

Collaborating Authors

 training time


First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

Neural Information Processing Systems

As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline. Codes are available at: https://casl-ku.github.io/FAL/


AFaster Training Algorithm for Regression Trees with Linear Leaves, and an Analysis of its Complexity

Neural Information Processing Systems

We consider the Tree Alternating Optimization (TAO) algorithm to train regression trees with linear predictors in the leaves. Unlike the traditional, greedy recursive partitioning algorithms such as CART, TAO guarantees a monotonic decrease of the objective function and results in smaller trees of much better accuracy. We modify the TAO algorithm so that it produces exactly the same result but is much faster, particularly for high input dimensionality or deep trees. The idea is based on the fact that, at each iteration of TAO, each leaf receives only a subset of the training instances. Thus, the optimization of the leaf model can be done exactly but faster by using the Sherman-Morrison-Woodbury formula. This has the unexpected advantage that, once a tree exceeds a critical depth, then making it deeper makes it faster to train, even though the tree is larger and has more parameters. Indeed, this can make learning a nonlinear model (the tree) asymptotically faster than a regular linear regression model. We analyze the corresponding computational complexity and verify the speedups experimentally in various datasets. The argument can be applied to other types of trees, whenever the optimization of a node can be computed in superlinear time of the number of instances.


Deep Legendre Transform

Neural Information Processing Systems

We introduce a novel deep learning algorithm for computing convex conjugates of differentiable convex functions, a fundamental operation in convex analysis with various applications in different fields such as optimization, control theory, physics and economics. While traditional numerical methods suffer from the curse of dimensionality and become computationally intractable in high dimensions, more recent neural network-based approaches scale better, but have mostly been studied with the aim of solving optimal transport problems and require the solution of complicated optimization or max-min problems. Using an implicit Fenchel formulation of convex conjugation, our approach facilitates an efficient gradient-based framework for the minimization of approximation errors and, as a byproduct, also provides a posteriori estimates of the approximation accuracy. Numerical experiments demonstrate our method's ability to deliver accurate results across different high-dimensional examples. Moreover, by employing symbolic regression with Kolmogorov-Arnold networks, it is able to obtain the exact convex conjugates of specific convex functions.


You Only Once Taking a Spectral Detour to Accelerate Graph Neural Network

Neural Information Processing Systems

Training Graph Neural Networks (GNNs) often relies on repeated, irregular, and expensive message-passing operations over all nodes (e.g., N), leading to high computational overhead. To alleviate this inefficiency, we revisit the GNNs training from the spectral perspective.


59ea33ae3d096f3bcd5026b479710cf8-Paper-Conference.pdf

Neural Information Processing Systems

Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resourceintensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.


Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Neural Information Processing Systems

Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.


Dense Attention Latency: 1649s Radial Attention (Ours) Latency: 876s (1.9 Faster) PSNR: 27.3 (a) 117 Frames (Default Length)

Neural Information Processing Systems

Recent advances in diffusion models have enabled high-quality video generation, b making ut the additional training and temporal inference dimension on long significantly videos prohibiti increases vely computational expensive. In costs, this paper diffusion, we models: identify post-softmax a phenomenon attention we term scores Spatiotempor diminish al as Ener spatial gy Decay and temporal in video distance o scalable ver space sparse between and time attention tok in ens nature.


LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

Neural Information Processing Systems

Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions.


First Attentions Last: Better Exploiting First Attentions for Efficient Parallel Training

Neural Information Processing Systems

As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline. Codes are available at: https://casl-ku.github.io/FAL/


GSRF: Complex-Valued 3D Gaussian Splatting for Efficient Radio-Frequency Data Synthesis

Neural Information Processing Systems

Synthesizing radio-frequency (RF) data given the transmitter and receiver positions, e.g., received signal strength indicator (RSSI), is critical for wireless networking and sensing applications, such as indoor localization. However, it remains challenging due to complex propagation interactions, including reflection, diffraction, and scattering. State-of-the-art neural radiance field (NeRF)-based methods achieve high-fidelity RF data synthesis but are limited by long training times and high inference latency. We introduce GSRF, a framework that extends 3D Gaussian Splatting (3DGS) from the optical domain to the RF domain, enabling efficient RF data synthesis. GSRF realizes this adaptation through three key innovations: First, it introduces complex-valued 3D Gaussians with a hybrid Fourier-Legendre basis to model directional and phase-dependent radiance. Second, it employs orthographic splatting for efficient ray-Gaussian intersection identification. Third, it incorporates a complex-valued ray tracing algorithm, executed on RF-customized CUDA kernels and grounded in wavefront propagation principles, to synthesize RF data in real time. Evaluated across various RF technologies, GSRF preserves high-fidelity RF data synthesis while achieving significant improvements in training efficiency, shorter training time, and reduced inference latency.